Model Selection

Multimodal Video Understanding

# Multimodal Video Understanding

Cosmos Reason1 7B GGUF

Cosmos-Reason1 is a physics AI model developed by NVIDIA, capable of understanding physical common sense and generating embodied decision-making natural language through long-chain reasoning.

Transformers English

Cosmos Reason1 7B

Cosmos-Reason1 is a physical AI model developed by NVIDIA, capable of understanding physical common sense and generating embodied decisions through long-chain reasoning.

Transformers English

A fine-tuned version based on the lmms-lab/llava-onevision-qwen2-7b-ov model, supporting video-text-to-text conversion tasks.

Internvideo2 Stage2 6B

InternVideo2 is a multimodal video understanding model with 6B parameters, focusing on video content analysis and comprehension tasks.

Qwen2.5 VL 72B Instruct Pointer AWQ

Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring enhanced visual understanding, agent capabilities, and structured output generation.

Transformers English

VL3 SigLIP NaViT

The visual encoder for VideoLLaMA3, utilizing Arbitrary Resolution Visual Tokenization (AVT) technology to dynamically process images and videos of different resolutions.

Transformers English

Videollama2.1 7B 16F Base

VideoLLaMA2.1 is an upgraded version of VideoLLaMA2, focusing on enhancing spatiotemporal modeling and audio understanding capabilities in large video-language models.

Transformers English

Videollama2.1 7B 16F

VideoLLaMA 2 is a multimodal large language model focused on video understanding, equipped with spatiotemporal modeling and audio comprehension capabilities.

Transformers English

Videollama2 72B

VideoLLaMA 2 is a multimodal large language model focused on video understanding and spatio-temporal modeling, supporting video and image inputs, capable of performing visual question answering and dialogue tasks.

Transformers English

Tarsier-34b is an open-source large-scale video-language model focused on generating high-quality video captions and achieving leading results in multiple public benchmarks.

Videollama2 8x7B Base

VideoLLaMA 2 is a next-generation video large language model, focusing on enhancing spatiotemporal modeling and audio understanding capabilities, supporting multimodal video question answering and description tasks.

Transformers English

Videollama2 8x7B

VideoLLaMA 2 is a multimodal large language model focused on video understanding and audio processing, capable of handling video and image inputs to generate natural language responses.

Transformers English

Llava NeXT Video 34B Hf

LLaVA-NeXT-Video is an open-source multimodal chatbot trained on mixed video and image data, excelling in video understanding capabilities.

Transformers English

Llava NeXT Video 7B DPO Hf

LLaVA-NeXT-Video is an open-source multimodal chatbot optimized through mixed training on video and image data, possessing excellent video understanding capabilities.

Transformers English

Llava NeXT Video 7B Hf

LLaVA-NeXT-Video is an open-source multimodal chatbot that achieves excellent video understanding capabilities through mixed training on video and image data, reaching SOTA level among open-source models on the VideoMME benchmark.

Transformers English

Sharegpt4video 8b

ShareGPT4Video-8B is an open-source video chatbot, fine-tuned on open-source video instruction data.

Xclip Base Patch16 Kinetics 600 16 Frames

X-CLIP is an extension of CLIP for general video-language understanding, supporting zero-shot, few-shot, or fully supervised video classification, as well as video-text retrieval tasks.

Transformers English

Xclip Base Patch16 Kinetics 600

X-CLIP is an extended version of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.

Transformers English

Xclip Large Patch14

X-CLIP is an extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.

Transformers English

Xclip Base Patch16 16 Frames

X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained via contrastive learning on (video, text) pairs.

Transformers English

Xclip Base Patch32 16 Frames

X-CLIP is an extended version of CLIP for general video-language understanding, trained on video-text pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.

Transformers English

Xclip Base Patch32

X-CLIP is an extended version of CLIP for general video-language understanding, trained on (video, text) pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase